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Improved staggered fermion formulations are a popular choice for lattice QCD calculations. 
Historically, the algorithm used for such calculations has been the inexact R algorithm, which 
has systematic errors that only vanish as the square of the integration step-size. We describe 
how the exact Rational Hybrid Monte Carlo (RHMC) algorithm may be used in this context, 
and show that for parameters corresponding to current state-of-the-art computations it leads 
to a factor of approximately seven decrease in cost as well as having no step-size errors. 
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Staggered fermions are a popular choice for lattice 
QCD calculations: they are generally thought of as being 
much cheaper than other fermion formulations, and have 
a remnant of the continuum theory's chiral symmetry 
protecting them from additive mass renormalization. In 
order to overcome the infamous fermion doubling prob- 
lem, the spin degrees of freedom arc spatially separated 
reducing the 2 fermion "tastes" to four. To further re- 
duce these four tastes to a single quark (or doublet) the 
fourth (square) root of this kernel is taken. While this is 
a valid prescription in the continuum limit, it is unclear 
what unwanted side-effects this introduces to the lattice 
theory. At best this treatment is an algorithmic nuisance, 
at worst it renders the theory non-local, non-universal, 
and non-unitary rendering any calculations made using 
such a formulation questionable. 

The reduction from 16 to 4 fermions also introduces 
0(a 2 ) unphysical taste-mixing interactions, where a is 
the lattice spacing. There are various improved formu- 
lations of staggered fermions, all of which redefine the 
Dirac operator by some local averaging of the gauge field 
variables; the paths and coefficients of this averaging are 
chosen to minimize these unphysical interactions. The 
most popular formulation is the ASQTAD prescription, 
where the Dirac operator is constructed from a gauge co- 
variant derivative consistingof the sum of the product of 
1, 3, 5 and 7 link variables [l|. 

After integrating out the Grassman-valued quark 
fields, the 2+1 quark flavor QCD partition function is 
given by a functional integral over gauge fields, 

Z = J dUdet(M e )? det(M s )ie- SG , 

where S G is the gauge action (for ASQTAD fermions 
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this is usually the one-loop improved Symanzik action) 
and M. = — FJ +4m 2 is the fermion matrix, with F) being 
the staggered covariant derivative and m the quark mass. 
The determinants represent the (^)ight and (s)trange 
quark contributions to the vacuum. It is the matrix 
square root and fourth root that make the formulation 
so algorithmically problematic: the Hybrid Monte Carlo 
(HMC) algorithm commonly used for lattice calcula- 
tions with dynamical fermions not being directly appli- 
cable. 

The algorithm commonly used for performing these 
calculations is the Hybrid R algorithm Q . First the iden- 
tity dct.M = exptrln.M is used to write the action as 

S = S G - §trln.M* - itrlnM,, 

then a Gaussian-distributed "fictitious" momentum field 
7r, taking values in the Lie algebra sit(3) on links, is in- 
troduced so a Hamiltonian H — ^ir 2 + S may be defined. 
Configurations (jr, U) are then generated with probabil- 
ity proportional to e~ H by alternating two Markov steps, 
refreshment of the momenta from a Gaussian heatbath 
and evolution of the gauge fields following the MD (MD) 
trajectory defined by this Hamiltonian for time r. The 
MD trajectory is approximated by means of a numerical 
integrator with a step-size of St, so these steps give an 
ergodic Markov process with the desired fixed point up to 
step-size errors. Usually the second order leapfrog (2LF) 
integrator is used which requires a single force evalua- 
tion per step, and this leads to errors of 0(St 2 ) in the 
coefficients in the action. 

The fcrmionic contribution to the force acting on the 
gauge fields is approximated by means of a noisy estima- 
tor for each trace appearing in the action: this requires 
the introduction of an auxiliary "noise" field that is re- 
freshed after every update. Each fermion force evaluation 
requires that the inverse fermion matrix be applied to its 
auxiliary field, typically by means of a conjugate gradient 
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(CG) solver. The cost of the algorithm increases dramat- 
ically as the fermion mass is taken to zero, and the cost 
of the light quarks contribution dominates that of the 
strange. 

Additional 0(8t) errors are introduced by using noisy 
estimates of the fermionic force, regardless of the nu- 
merical integrator used. These may be made to vanish 
by refreshing the auxiliary field at an asymmetric time 
within each update step, dependent on the number of 
fermions; doing so leads to an integrator that is neither 
area-preserving nor reversible, and therefore the step-size 
errors cannot be removed by means of a Metropolis ac- 
ceptance test as with HMC. The R algorithm is thus 
necessarily an inexact algorithm with 0(St 2 ) errors. 

An alternative approach is to represent each fermionic 
determinant as a Gaussian integral over a bosonic 
"pscudofermion" field <j>: the resulting fermion action is 

S F = (j)\M t *4>i + 4>lMe 5 </> s . At this point the conven- 
tional HMC algorithm fails as one can neither evaluate 
the n th root of a matrix directly nor calculate its deriva- 
tive with respect to the gauge field as required for the 
MD evolution of the system. However, it is possible to 
replace the matrix kernel with an approximation that can 
be evaluated and differentiated. There are two obvious 
approximations: polynomial [3, Q and rational @, @|- 
Polynomial approximations do not require the explicit 
evaluation of a matrix inverse, but they typically have to 
be of much higher degree than the corresponding ratio- 
nal approximations with the same error: for example, a 
polynomial approximation to 1/A4 for an hermitian ma- 
trix M. is equivalent to computing the inverse by means 
of a Jacobi iteration scheme, and this is well-known to be 
far inferior to other iterative solvers such as CG. With 
this in mind, we focus on optimal rational approxima- 
tions (generally found using the Remez algorithm). We 
replace the square and fourth root kernels by rational ap- 
proximations, and continue as if we would for HMC, for 
which each complete update consists of a sequence of the 
following Markov steps 

• Momentum refreshment heatbath using Gaussian 
noise (P(n) oc e" 71 ""/ 2 ). 

• Pseudofermion refreshment (P{4>) oc Ai^r], where 
P(rj) oc e^ 2 / 2 , a = 2(4) for light (strange)). 

• MDMC update, built out of 

— A (MD) trajectory consisting of t/St steps. 

— A Metropolis accept/reject with probability 
P acc = min(l,e-' 5ff ). 

The key advantage over the R algorithm is that the 
Metropolis test at the end of the trajectory stochastically 
corrects any finite step-size errors introduced through the 
discrctized MD integration. 

It is necessary to generate the pseudofermion fields at 
the start of each trajectory from a heatbath. For the 
functions of interest, the roots and poles of optimal (in 
the minimax = = Chebyshev norm) rational approx- 
imations are real and non-degenerate: this motivates a 



fermionic action S F = 4K c {M e )] 2 <p e + (j)l[r™ c {M s )] 2 <t) a i 
where r™ c (x) « and r™ c (x) « .t~s are valid over 
the spectral bounds of the operator. These rational ap- 
proximations must in some sense be exact: typically this 
means the maximum error is either less than the toler- 
ance used in the solver or, more conservatively, less than 
the unit of least precision of double precision arithmetic 
(« 10" 15 ). 

When written in partial fraction form (we consider 
only the case where the numerator and denominator 
polynomial degrees are equal) r(x) = ao + X)fe=i x+ k /3 k ' 
where n is the degree of the approximation. It transpires 
that for the cases of interest all the ctfc and (3k are real and 
positive: this ensures numerical stability of the method. 

A multi-shift solver Q can be used to evaluate the ra- 
tional approximation with a cost similar to that of the 
inversion of the original matrix M, thus both the pseudo- 
fermion heatbath and the evaluation of the action for the 
Metropolis step can be carried out within a single Krylov 
space. A difficulty arises when the fermionic force has to 
be computed, as this seems to depend upon the deriva- 
tive of the square of the rational approximation r, which 
involves double poles; this is obviously undesirable, as 
naively it doubles the cost as compared to the R algo- 
rithm. However, there is no reason why the MD and MC 
approximations need be the same: one is free to choose 
the MD approximation to avoid the square, and also use 
a lower degree approximation should one so wish. Indeed, 
the optimal approximation to M. 2a is not the square of 
that for Ai a anyhow. The MD action is thus written 

Sf = 4rr(M e )^ + D (M.)&, (1) 

where the approximations r" D (x) m x~i and r™(x) m 
x~? are again valid over the appropriate spectral ranges. 

The derivative of each of the bilinear terms that appear 
in equation (TTJ) is written as 
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where Xf. = ^/uk(M + /3fc) _1 0. The calculation of M! 
must be done explicitly using the chain rule and repre- 
sents a considerable computational expense because of 
the complicated nature of the ASQTAD operator. The 
only dependence on the pseudofermion fields is in 
and even if there are many such fields one need only cal- 
culate M! once. For n > 3 it is more expensive to use 
matrix-vector operations @ than matrix-matrix opera- 
tions ([3|), hence we use the latter Q. 

The naive cost of RHMC is similar to that of the R 
algorithm at the same step-size: however, the actual cost 
is determined by the integration step-size that can be 
used, and the algorithms' respective autocorrelations. 

Before we present results from such a comparison, we 
describe a reformulation of the fermion action which 
greatly reduces the cost of RHMC. It has been shown that 
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"mass preconditioning" the fcrmion force leads to a large 
reduction in the computational cost 0]; the fermion ker- 
nel is multiplied by the inverse of a similar kernel with a 
larger mass, and the desired distribution is then recovered 
by also introducing a new pseudofcrmion field with this 
larger mass kernel. This is beneficial because the result- 
ing pseudofcrmion kernels are less singular, and thus lead 
to smaller forces acting on the gauge fields; this in turn 
permits the use of a larger integration step-size before 
the symplectic integrator becomes unstable. A multiple 
timescale numerical integrator where the mass precondi- 
tioned light pseudofermion force is evaluated less often 
than that of the new pseudofermion then gives further 
speed-up: when tuning such an algorithm the product of 
each step-size with the magnitude of the corresponding 
force contribution should be the same to a first approxi- 
mation. 

In previous work [To| mass preconditioning was ap- 
plied to 2 flavor Wilson fermion calculations, but neither 
to 2 + I quark flavors nor to staggered fermions, both 
of which lend themselves particularly well to this tech- 
nique because the strange quark can be used as a pre- 
conditioner, as shall now be explained. The 2 + 1 flavor 
staggered determinant may be written as 
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where the light quark has been "mass preconditioned" by 
the strange quark. The corresponding action is 
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where in the second line use has been made of the 
fact that the staggered quark mass is additive, M B = 
A4 e + Sm 2 with Sm 2 = 4(mf — m 2 ), so that the first term 
may be written in terms of a single rational function of 
the light quark kernel in equation . The cost of evalu- 
ating the action is almost unchanged since no extra fields 
have been introduced, rather better use is being made of 
the already present strange quark. Since the light quark 
is preconditioned by the strange quark we expect that its 
contribution to the total force will be small in comparison 
to the "triple strange" . This is advantageous because the 
triple strange force is much cheaper to evaluate since it 
is a rational approximation of the more massive strange 
quark kernel. Thus the hierarchy of the integrators is to 
have the gauge field force updates (which are cheap com- 
pared with fermionic force evaluations) on the smallest 
time scale, the triple strange on an intermediate scale, 
and the preconditioned light quark on the largest scale. 
We note that choosing the strange quark to precondition 
the light quarks is merely down to the former's presence: 
if the strange quark were not already present then the op- 
timal prccondittioncd would certainly have a difference 




FIG. 1: A comparison of the relative L x magnitude of the 
forces of the original 2 + 1 RHMC ASQTAD formulation, and 
that of the mass preconditioned one. In the former the light 
quark force is more than three times that of the strange con- 
tribution, in the latter it is two orders of magnitude smaller 
(V" = 4 4 , f3 = 6.76, m t = 0.01, m s = 0.05). 



mass. Figure Q] is an example of the effect the mass con- 
ditioning has on the forces: the vast reduction in mag- 
nitude of the light quark contribution is evident. A nice 
side effect of such mass preconditioning is that the ap- 
proximation degree required for the preconditioned light 
kernel is much less than for the original one because its 
kernel is much better conditioned. 

As well as mass preconditioning, there are other im- 
provements that can be made to the RHMC alg orithm. 
The n th root multiple pseudofermion trick [12j | can be 
used to allow a larger integration step-size at constant 
acceptance rate: here one pseudofermion with kernel 
M a is replaced by n pseudofermions each with kernel 
M a/n . This method removes the integrator instability, 
after which it becomes advantageous to use higher-order 
or improved integrators [l3[, e.g. , a second order mini- 
mum norm (2MN) integrator [TJ]. The tolerance of the 
MD solver can be tuned on a per-shift basis; in particu- 
lar the tolerances of least well-conditioned shifts can be 
loosened considerably with little affect on the acceptance 
rate. This leads to a large reduction in the number of 
CG iterations Q. 

In order to test the improved RHMC algorithm, a suit- 
ably challenging set of parameters were chosen, as speci- 
fied in Table U These parameters also represent a recent 
run of the R algorithm performed by the MILC collabo- 
ration, so a direct comparison is possible. 

From previous studies of the R algorithm it has been 
found empirically that choosing an integrator step-size 
St ~ %m t leads to step-size errors that are deemed to be 
negligible. On the other hand, deciding on the optimum 
step-sizes for multiple timescale RHMC algorithm is a 
more involved process. When constructing the multi- 
scale integrator, it was decided to evaluate the gauge 
force using a fourth order integrator pd| to minimize 
this part of the cost. After constructing the multiple 
timescale integrator, a single MD trajectory was used to 
measure the force contributions in order to estimate the 
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R [15] 



RHMC 



t/St 300 
(Hmd , %c) 
resMD (5 ■ KT 5 ) M (HT 4 ) a , (KT 5 ) f 

resMc 

iV CK 344,988 



5 S , e , 25gaug e 

(6,10)., (4,7), 



(io- 10 ) s , 

70,230 



N aopB 



5.96 x 10 1 



8.54 x 10 ld 
0.775(23) 



TABLE I: Parameters and performance: n denotes the ra- 
tional approximation degree, P acc the Metropolis acceptance 
rate, and res the CG solver residual. The subscripts s and I 
indicate the fermion flavor. N cs and iVf lopB are the total CG 
iterations and floating point operations per trajectory respec- 
tively. The computations were performed on a V = 24 3 .64 
lattice with (3 = 6.76, m £ = 0.005, m B = 0.05, and r = 1.0. 



step-size ratios. It was observed that the cost of the cal- 
culation was dominated by the triple strange; this in turn 
was dominated by the calculation of the matrix deriva- 
tive, and not the matrix inversion (this effect is exacer- 
bated by the tolerance loosening optimization) ; thus any 
further mass preconditioning would be of little benefit. 
To further reduce the algorithm cost, the n th root mul- 
tiple pseudofermion formulation was used for the triple 
strange, with 3 pseudofermions, this allowed for a large 
increase of step-size, so much so in fact that the optimal 
parameters were found to be those where the light and 
triple strange were all on the same time scale again, i.e., 
a two level integrator separating the fermion and gauge 
timcscalcs. It is interesting to note that the optimal algo- 
rithm for 2+1 flavour ASQTAD fermions requires both 
mass preconditioning and the n th root trick, emphasizing 
the opinion expressed in [l2| that these approcahes can 
be complimentary 

A summary of the optimal parameters and costs is 
given in Table |TJ Perhaps the most interesting result 
is the factor of 7 reduction in the number floating point 
operations per trajectory relative to the R algorithm. It 
must be emphasized this factor does not take into account 
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FIG. 2: An extrapolation of the plaquette in 8t comparing 
the exact RHMC result with the inexact R result [l5| . 



the additional benefit of using an exact algorithm over an 
inexact one, illustrated in figure O The autocorrelation 
lengths of the plaquette using the algorithms are simi- 
lar, indeed they are essentially the same after dividing 
by RHMC's acceptance rate, and this is expected to be 
true for all other autocorrelations too. 

In this letter, we have described the exact RHMC al- 
gorithm, and how this algorithm can be used to greatly 
accelerate the generation of lattice QCD gauge field en- 
sembles with dynamical ASQTAD fermions compared 
to current techniques using the inexact R algorithm. 
Through a switch of algorithm, a production job that 
takes a year becomes at worst a matter of months, while 
at the same time removing a source of systematic error. 
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